BEXCIS: Bayesian methods for estimating the degree of the skewness of X chromosome inactivation

Background X chromosome inactivation (XCI) is an epigenetic phenomenon that one of two X chromosomes in females is transcriptionally silenced during early embryonic development. Skewed XCI has been reported to be associated with some X-linked diseases. There have been several methods measuring the degree of the skewness of XCI. However, these methods may still have several limitations. Results We propose a Bayesian method to obtain the point estimate and the credible interval of the degree of XCI skewing by incorporating its prior information of being between 0 and 2. We consider a normal prior and a uniform prior for it (respectively denoted by BN and BU). We also propose a penalized point estimate based on the penalized Fieller’s method and derive the corresponding confidence interval. Simulation results demonstrate that the BN and BU methods can solve the problems of extreme point estimates, noninformative intervals, empty sets and discontinuous intervals. The BN method generally outperforms other methods with the lowest mean squared error in the point estimation, and well controls the coverage probability with the smallest median and the least variation of the interval width in the interval estimation. We apply all the methods to the Graves’ disease data and the Minnesota Center for Twin and Family Research data, and find that SNP rs3827440 in the Graves’ disease data may undergo skewed XCI towards the allele C. Conclusions We recommend the BN method for measuring the degree of the skewness of XCI in practice. The R package BEXCIS is publicly available at https://github.com/Wen-YiYu/BEXCIS. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04721-y.

the W SD 's and W IQR 's of the BN, BU and Fieller's methods increase while those of the PF method decrease as MAF gets bigger with n = 500 or n becomes larger with MAF = 0.1. For the quantitative trait, the W SD 's and W IQR 's of the BN and BU methods still turn to be larger while those of the PF and Fieller's methods become smaller for higher MAF with n = 500 or greater n with MAF = 0.1. Fixing MAF = 0.1, the W SD 's of the BN, BU and Fieller's methods increase while those of the PF method decline when the trait turns from qualitative into quantitative. When n = 500 and MAF = 0.1, if the trait turns from qualitative into quantitative, the W IQR 's of these four methods all increase. When n = 2,000 and MAF = 0.1, the W IQR 's of these four methods for the qualitative trait are similar to those for the quantitative trait. Except for the situations we discussed above, the W SD 's and W IQR 's of these four interval estimation methods generally become less with larger n, bigger MAF or the trait changing from qualitative to quantitative. The reason for these results can be listed as follows. The BN, BU and Fieller's methods generally obtain wide intervals while the PF method has more chance to get shorter CIs with many noninformative CIs when n = 500 and MAF = 0.1, especially for the qualitative trait (Additional file 2: Supplementary Figures S25-S27 and S36-S38). So, the BN, BU and Fieller's methods have smaller W SD 's and W IQR 's compared to the PF method when n = 500 and MAF = 0.1. The four methods all have more chance to get shorter intervals as n becomes larger, MAF gets higher or the trait turns from qualitative into quantitative, which tend to enlarge the variation of the BN, BU 1 and Fieller's methods but reduce that of the PF method. For the situations with larger n, higher MAF or the trait turns from qualitative into quantitative, all the four methods are likely to obtain shorter intervals, and increasing n, MAF or changing the trait from qualitative to quantitative may lead to smaller W SD 's and W IQR 's of the four methods.

Simulation settings with a covariate
Assume that the frequencies of the normal allele d and the deleterious allele D are q and p (p + q = 1), respectively.
Since the simulation results without any covariate in the Results section all indicate that ρ has little impact on the performances of the proposed methods, we only consider the situations where the inbreeding coefficient ρ = 0 for these additional simulations with a covariate. As such, we have the frequencies of genotypes dd, Dd and DD are (g 0 , g 1 , g 2 ) = (q 2 , 2pq, p 2 ). Then, we can simulate the samples of genotypes dd, Dd and DD from a trinomial distribution with probabilities (g 0 , g 1 , g 2 ). MAF (i.e., p) is set to 0.3 and 0.1, and the sample size n is fixed at 500 and 2,000. For the qualitative trait, where Z i is the covariate of female i which is simulated from the standard normal distribution and b is the corresponding regression coefficient. To fix the case-control ratio at 1 : 1, we keep sampling from the trinomial distribution until the number of the cases and that of the controls are both greater than n/2. Then, we randomly extract n/2 samples from the case group and the control group to create the final study sample of size n, respectively. We can accordingly get X 1i and X 2i for all the females. On the other hand, for the quantitative trait, we directly simulate the numbers n 0 , n 1 and n 2 with n 0 + n 1 + n 2 = n for genotypes dd, Dd and DD from the trinomial distribution with probabilities (g 0 , g 1 , g 2 ) and Y i is generated by For both the qualitative trait and the quantitative trait, β 0 , β and b are set to be 0, 0.3 and 0.5, respectively, and γ is randomly sampled from U (0, 2). For the quantitative trait, (σ 2 0 , σ 2 1 , σ 2 2 ) = (1, 1.2, 1). For each simulation setting, we conduct 500 replicates (i.e., 500 SNPs) and the confidence level (1 − α) is fixed at 95% for the frequentist methods.
To make the HPDIs comparable to the CIs, we calculate the 95% HPDIs for the Bayesian methods. The prior distributions of γ, β 0 , β, b and σ j (j=0, 1, 2) in the Bayesian methods are selected as follows: γ ∼ U (0, 2) and γ ∼ N (1, 1) ∈ [0, 2], β 0 ∼ N (0, 10 2 ), β ∼ N (0, 10 2 ), b ∼ N (0, 10 2 ) and σ j ∼ exp(1). We set 8 chains to extract the samples parallelly and simultaneously. We extract 20,000 samples in each chain, among which the first 10,000 samples are only used for warming up and are discarded when the sampling is finished. So eventually, we get 80,000 samples in total. The target acceptance rate is set to be 0.99 to ensure the convergence. The convergence diagnostiĉ R for Markov chains in the Bayesian method is done, and the calculatedR's in our Bayesian models are all less than 1.05 which indicates good convergence (data not shown). The simulation study is implemented by the R software (version 4.0.0).

Simulation results with a covariate
The proportions of the extreme values ofγ P F andγ F among the 500 replicates for the qualitative trait and the quantitative trait when (σ 2 0 , σ 2 1 , σ 2 2 ) = (1, 1.2, 1) with a covariate and ρ = 0 are presented in Additional file 3: Supplementary More importantly, the Bayesian methods also show their own advantages over the PF and Fieller's methods in both the point estimation and the interval estimation.